192 research outputs found

    Multiplication by rational constants: LIP research report 2011-3

    Get PDF
    International audienceMultiplications by simple rational constants often appear in fixed-point or floating-point application code, for instance in the form of division by an integer constant. The hardware implementation of such operations is of practical interest to FPGA-accelerated computing. It is well known that the binary representation of rational constants is eventually periodic. This article shows how this feature can be exploited to implement multiplication by a rational constant in a number of additions that is logarithmic in the precision. An open-source implementation of these techniques is provided, and is shown to be practically relevant for constants with small numerators and denominators, where it provides improvements of 20 to 40\% in area with respect to the state of the art. It is also shown that for such constants, the additional cost for a correctly rounded result is very small, and that correct rounding very often comes for free in practice

    The Price of Routing in FPGAs

    Get PDF
    Among integrated circuits, field programmable gate arrays (FPGAs) may be the most spectacular benefactors of the steady progress of very large scale integration (VLSI) technology in the last two decades: current FPGA chips offer up to one million programmable gates. A close look at recent families of mainstream FPGAs, however, shows that the amount of silicon dedicated to routing increases much faster than that devoted to the gates themselves. The main reason for this is the fact that the routing needs to be reconfigurable. A simple quantitative analysis shows that this trend, if pursued, will lead to a widening gap between FPGA performance and full-cust- om VLSI performance. Some prospective solutions to this problem are exposed and discussed

    Certifying floating-point implementations using Gappa

    Full text link
    High confidence in floating-point programs requires proving numerical properties of final and intermediate values. One may need to guarantee that a value stays within some range, or that the error relative to some ideal value is well bounded. Such work may require several lines of proof for each line of code, and will usually be broken by the smallest change to the code (e.g. for maintenance or optimization purpose). Certifying these programs by hand is therefore very tedious and error-prone. This article discusses the use of the Gappa proof assistant in this context. Gappa has two main advantages over previous approaches: Its input format is very close to the actual C code to validate, and it automates error evaluation and propagation using interval arithmetic. Besides, it can be used to incrementally prove complex mathematical properties pertaining to the C code. Yet it does not require any specific knowledge about automatic theorem proving, and thus is accessible to a wide community. Moreover, Gappa may generate a formal proof of the results that can be checked independently by a lower-level proof assistant like Coq, hence providing an even higher confidence in the certification of the numerical code. The article demonstrates the use of this tool on a real-size example, an elementary function with correctly rounded output

    Conception d'une matrice reconfigurable pour coprocesseur fortement couplé

    Get PDF
    National audienceCet article étudie la conception d'un opérateur reconfigurable à grain moyen fortement couplé, intégré dans un \textit{System on Chip} (Soc) haute performance et faible consommation. Nous présentons un environnement logiciel en cours de développement destiné à l'exploration architecturale d'une telle solution. L'architecture reconfigurable, très paramétrique, se compose d'une matrice de cellules à grain moyen plongée dans un réseau configurable orienté flot de donnée. Elle est associée à un compilateur qui génère la configuration à partir d'un programme en syntaxe C ou VHDL. Le fonctionnement de cet environnement est illustré sur 3 exemples : l'encryption AES, le corps de boucle d'une fonction de hachage (SHA-1) et un filtre à réponse impulsionnelle finie (FIR)

    Floating-point exponential functions for DSP-enabled FPGAs

    Get PDF
    International audienceThis article presents a floating-point exponential operator generator targeting recent FPGAs with embedded memories and DSP blocks. A single-precision operator consumes just one DSP block, 18Kbits of dual-port memory, and 392 slices on Virtex-4. For larger precisions, a generic approach based on polynomial approximation is used and proves more resource-efficient than the literature. For instance a double-precision operator consumes 5 BlockRAM and 12 DSP48 blocks on Virtex-5, or 10 M9k and 22 18x18 multipliers on Stratix III. This approach is flexible, scales well beyond double-precision, and enables frequencies close to the FPGA's nominal frequency. All the proposed architectures are last-bit accurate for all the floating-point range.They are available in the open-source FloPoCo framework

    Fonctions élémentaires en virgule flottante pour les accélérateurs reconfigurables

    Get PDF
    National audienceLes circuits reconfigurables FPGA ont désormais une capacité telle qu'ils peuvent être utilisés à des tâches d'accélération de calcul en virgule flottante. La littérature (et depuis peu les constructeurs) proposent des opérateurs pour les quatre opérations. L'étape suivante est de proposer des opérateurs pour les fonctions élémentaires les plus utilisées. Parmi celles-ci, nous proposons des architectures dédiées pour l'évaluation des fonctions exponentielles, logarithme, sinus et cosinus, et étudions les compromis possibles. Pour chacune de ces fonctions, un seul de ces opérateurs surpasse d'un facteur dix les processeurs généralistes en terme de débit, tout en occupant une fraction des ressources matérielles du FPGA. Tous ces opérateurs sont disponibles librement sur http://www.ens-lyon.fr/LIP/Arenaire/

    Multi-operand Decimal Adder Trees for FPGAs

    Get PDF
    The research and development of hardware designs for decimal arithmetic is currently going under an intense activity. For most part, the methods proposed to implement fixed and floating point operations are intended for ASIC designs. Thus, a direct mapping or adaptation of these techniques into a FPGA could be far from an optimal solution. Only a few studies have considered new methods more suitable for FPGA implementations. A basic operation that has not received enough attention in this context is multi-operand BCD addition. For example, it is of interest for low latency implementations of decimal fixed and floating point multipliers and decimal fused multiply-add units. We have explored the most representative proposals for multi-operand BCD addition and found that the resultant implementations in FPGAs are still very inefficient in terms of both area and latency when compared to their binary counterparts. In this paper we present a new method for fast and efficient implementation of multi-operand BCD addition in current FPGA devices. In particular, our proposal maps quite well into the slice structure of the Xilinx Virtex-5/Virtex-6 families and it is highly pipelineable. The synthesis results for a Virtex-6 device indicate that our implementations halve the area and latency of previous proposals, presenting area and delay figures close to those of optimal binary adder trees.La recherche sur l'implantation en matériel de l'arithmétique décimale est actuellement très active, la plupart des travaux portant sur des opérateurs pour les processeurs, en virgule fixe ou flottante. Mais les techniques développées pour un circuit intégré n'aboutissent pas forcément à une implémentation optimale dans un FPGA. Il n'y a que peu d'études ciblant explicitement les FPGA. Cet article s'intéresse dans ce contexte, à l'addition BCD multi-opérande, au cœur de multiplieurs et de multiplieurs-accumulateurs à faible latence. Nous étudions les architectures proposées pour cette opération décimale, et nous observons que, sur FPGA, leur performance (surface et latence) est très inférieure à celle des opérations binaire à précision comparable. Nous présentons donc dans cet article une nouvelle technique d'addition BCD multi-opérandes qui s'avère plus efficace que les propositions précédentes sur les FPGA actuels. Elle s'adapte particulièrement bien à la structure fine des FPGA Xilinx Virtex-5/Virtex-6, et se prête bien au pipeline. Les résultats de synthèse montrent que notre implémentation divise par deux la surface et la latence par rapport aux propositions précédentes, les ramenant à des valeurs comparables à celles des meilleurs additionneurs multi-opérandes binaires

    Large multipliers with less DSP blocks

    Get PDF
    International audienceRecent computing-oriented FPGAs feature DSP blocks including small embedded multipliers. A large integer multiplier, for instance for a double-precision floating-point multiplier, consumes many of these DSP blocks. This article studies three non-standard implementation techniques of large multipliers: the Karatsuba-Ofman algorithm, non-standard multiplier tiling, and specialized squarers. They allow for large multipliers working at the peak frequency of the DSP blocks while reducing the DSP block usage. Their overhead in term of logic resources, if any, is much lower than that of emulating embedded multipliers. Their latency overhead, if any, is very small. Complete algorithmic descriptions are provided, carefully mapped on recent Xilinx and Altera devices, and validated by synthesis results

    Multipartite table methods

    Get PDF
    International audienceA unified view of most previous table-lookup-and-addition methods (bipartite tables, SBTM, STAM, and multipartite methods) is presented. This unified view allows a more accurate computation of the error entailed by these methods, which enables a wider design space exploration, leading to tables smaller than the best previously published ones by up to 50 percent. The synthesis of these multipartite architectures on Virtex FPGAs is also discussed. Compared to other methods involving multipliers, the multipartite approach offers the best speed/area tradeoff for precisions up to 16 bits. A reference implementation is available at www.ens-lyon.fr/LIP/Arenaire/

    Optimizing polynomials for floating-point implementation

    Get PDF
    The floating-point implementation of a function on an interval often reduces to polynomial approximation, the polynomial being typically provided by Remez algorithm. However, the floating-point evaluation of a Remez polynomial sometimes leads to catastrophic cancellations. This happens when some of the polynomial coefficients are very small in magnitude with respects to others. In this case, it is better to force these coefficients to zero, which also reduces the operation count. This technique, classically used for odd or even functions, may be generalized to a much larger class of functions. An algorithm is presented that forces to zero the smaller coefficients of the initial polynomial thanks to a modified Remez algorithm targeting an incomplete monomial basis. One advantage of this technique is that it is purely numerical, the function being used as a numerical black box. This algorithm is implemented within a larger polynomial implementation tool that is demonstrated on a range of examples, resulting in polynomials with less coefficients than those obtained the usual way.Comment: 12 page
    • …
    corecore